Vector-thread architecture and implementation

نویسنده

Ronny Krashinsky

چکیده

This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for lowpower and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4–1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication. Larger applications also include a JPEG image encoder and an IEEE 802.11a wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3–11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor. Thesis Supervisor: Krste Asanović Title: Associate Professor

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Instruction Execution Mechanism for Responsive Multithreaded Processor

This paper describes the instruction execution mechanism of Responsive Multithreaded (RMT) Processor for distributed real-time processing. The execution order of each thread is controlled by using priority in RMT Processor. The highest priority thread is executed first in RMT Processor. Real-time applications, such as soft real-time processing including multimedia processing, require high compu...

متن کامل

The implementation of the parallel shortest vector enumerate in the block Korkin-Zolotarev method

This article present a parallel CPU implementation of Kannan algorithm for solving shortest vector problem in Block Korkin-Zolotarev lattice reduction method. Implementation based on Native POSIX Thread Library and show linear decrease of runtime from number of threads.

متن کامل

A SIMD Approach to Thread Matching for Interleaved Multithreading

Interleaved multithreading processors offer improved performance and power efficiency in a multithreading environment compared to standard CPUs by allowing multiple threads to share a single processing pipeline. However, resource contention is a natural result of such a system and can determine how well the overall thread group performs on the processor. Selecting threads which perform well tog...

متن کامل

Design and Implementation of Digital Demodulator for Frequency Modulated CW Radar (RESEARCH NOTE)

Radar Signal Processing has been an interesting area of research for realization of programmable digital signal processor using VLSI design techniques. Digital Signal Processing (DSP) algorithms have been an integral design methodology for implementation of high speed application specific real-time systems especially for high resolution radar. CORDIC algorithm, in recent times, is turned out to...

متن کامل

Preex Computations on Symmetric Multiprocessors (preliminary Draft)

We introduce a new optimal preex computation algorithm on linked lists which builds upon the sparse ruling set approach of Reid-Miller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, we can bound our complexity with high probability instead of merely on average. Moreover, whereas Reid-Miller and Blelloch targeted their algorithm for implemen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Vector-thread architecture and implementation

نویسنده

چکیده

منابع مشابه

The Instruction Execution Mechanism for Responsive Multithreaded Processor

The implementation of the parallel shortest vector enumerate in the block Korkin-Zolotarev method

A SIMD Approach to Thread Matching for Interleaved Multithreading

Design and Implementation of Digital Demodulator for Frequency Modulated CW Radar (RESEARCH NOTE)

Preex Computations on Symmetric Multiprocessors (preliminary Draft)

عنوان ژورنال:

اشتراک گذاری